DisClose : discovering colossal closed itemsets from high dimensional datasets via a compact row-tree
نویسنده
چکیده
Data mining is an essential part of knowledge discovery, and performs the extraction of useful information from a collection of data, so as to assist human beings in making necessary decisions. This thesis describes research in the field of itemset mining, which performs the extraction of a set of items that occur together in a dataset, based on a user specified threshold. Recent focus of itemset mining has been on the discovery of closed itemsets from high-dimensional datasets, characterised by relatively few rows and a relatively larger number of columns. A closed itemset is the maximal set of items common to a set of rows. By exponentially increasing running time as the average row length increases, mining closed itemsets from such datasets renders most column enumeration-based algorithm impractical. Existing row enumeration-based algorithms also show that they struggle to reach large cardinality closed itemsets. This is due to the implementation of the support constraint, which is based on the frequency of occurrence of the itemset. Frequent closed itemsets are usually smaller in size and larger in numbers, hence taking much of the memory space. Unfortunately, large cardinality closed itemsets are likely to be more informative than small cardinality closed itemsets in this type of dataset. The research investigates the area of large cardinality closed itemset discovery by examining and analysing the literature and identifying both strengths and weaknesses of existing approaches. Based on this synthesis, a new algorithm, termed DisClose, has been designed and developed to discover large cardinality (colossal) closed itemsets from high-dimensional datasets. The algorithm strategy begins by enumerating large cardinality itemsets and from these, builds smaller itemsets. This is done by applying a bottom-up search of the row-enumeration tree. A minimum cardinality threshold has been proposed to identify colossal closed itemsets and to further reduce the search space. A novel closedness-checking method has been proposed which uses a unique generator to immediately discover closed itemsets without the need to check if each new closed itemset has previously been found. These approaches have been combined using a Compact Row-Tree (CR-Tree) data structure designed to assist in the efficient discovery of the colossal closed itemsets. For evaluation purposes four state-of-the-art algorithms have been selected for comparison. Experimental results show that algorithm DisClose is scalable and can efficiently extract colossal closed itemsets in the considered dataset, even for low support thresholds that existing algorithms cannot find.
منابع مشابه
DisClose: Discovering Colossal Closed Itemsets via a Memory Efficient Compact Row-Tree
3:00 PM Session 1 DisClose: Discovering Colossal Closed Itemsets via a Memory Efficient Compact Row-Tree Nurul F. Zulkurnain, David J. Haglin and John A. Keane Triangular Kernel Nearest-Neighbor-based Clustering Algorithm for Discovering True Clusters Aina Musdholifah and Siti Zaiton Mohd Hashim An Improved Genetic Clustering Algorithm for Categorical Data Hongwu Qin, Xiuqin Ma, Tutut Herawan, ...
متن کاملA Patricia-Tree Approach for Frequent Closed Itemsets
In this paper, we propose an adaptation of the Patricia-Tree for sparse datasets to generate non redundant rule associations. Using this adaptation, we can generate frequent closed itemsets that are more compact than frequent itemsets used in Apriori approach. This adaptation has been experimented on a set of datasets benchmarks. Keywords—Datamining, Frequent itemsets, Frequent closed itemsets,...
متن کاملA Doubleton Pattern Mining Approach for Discovering Colossal Patterns from Biological Dataset
The running time of existing algorithms in Frequent Pattern Mining (FPM) increases exponentially with increasing average data size. The existing algorithms on high dimensional datasets create large number of frequent patterns of small and mid sizes which are ineffective for decision making and shows deficiency on mining process. To discover large patterns or Colossal Patterns Doubleton Pattern ...
متن کاملMINING FUZZY TEMPORAL ITEMSETS WITHIN VARIOUS TIME INTERVALS IN QUANTITATIVE DATASETS
This research aims at proposing a new method for discovering frequent temporal itemsets in continuous subsets of a dataset with quantitative transactions. It is important to note that although these temporal itemsets may have relatively high textit{support} or occurrence within particular time intervals, they do not necessarily get similar textit{support} across the whole dataset, which makes i...
متن کاملCLAIM: An Efficient Method for Relaxed Frequent Closed Itemsets Mining over Stream Data
Recently, frequent itemsets mining over data streams attracted much attention. However, mining closed itemsets from data stream has not been well addressed. The main difficulty lies in its high complexity of maintenance aroused by the exact model definition of closed itemsets and the dynamic changing of data streams. In data stream scenario, it is sufficient to mining only approximated frequent...
متن کامل